Skip to content

feat(query): Implement Vector Index with HNSW Algorithm #18134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

b41sh
Copy link
Member

@b41sh b41sh commented Jun 11, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

This PR introduces a vector index to Databend, leveraging the Hierarchical Navigable Small World (HNSW) algorithm for efficient similarity search.

Key Features:

  • Vector Index with HNSW: Implements a vector index based on the HNSW algorithm, enabling fast and accurate approximate nearest neighbor search on VECTOR data. Creating a vector index requires specifying the following parameters to fine-tune performance and accuracy:

    • m: Controls the number of connections (edges) per node in the HNSW graph. Higher values generally improve recall but increase index size and construction time.
    • ef_construct: Controls the search width during index construction, representing the number of neighbors considered during the building process. Higher values lead to better index quality but increase construction time.
    • distance: Specifies the supported distance calculation function(s) for the index. Acceptable values are cosine, l1, and l2. Multiple distance functions can be configured for a single index.
  • Distance Function Support: Provides comprehensive distance metric support for various similarity calculations:

    • cosine_distance: Calculates the cosine distance between vectors, suitable for measuring the angle between vectors and identifying semantic similarity.
    • l1_distance: Calculates the L1 distance (Manhattan distance) between vectors.
    • l2_distance: Calculates the L2 distance (Euclidean distance) between vectors.
  • L1 Distance Function Implementation: As part of this PR, the l1_distance function was implemented to provide a complete set of distance functions.

Implementation Details:

The implementation of the HNSW algorithm is primarily based on modifications to the excellent open-source HNSW implementation from github.com/qdrant/qdrant. We would like to express our sincere gratitude to the Qdrant team for their valuable work, which significantly accelerated the development of this feature.

Example Usage:

-- Create a table with a vector column and vector index.
CREATE TABLE t(id Int, embedding Vector(128), VECTOR INDEX idx (embedding) m=10 ef_construct=40 distance='cosine,l1,l2') Engine = Fuse;
-- Copy into sift 1000000 test vector data
COPY INTO t FROM 'fs:///data1/b41sh/sift/sift.csv' FILE_FORMAT = (type = CSV field_delimiter='|');

-- Select top 10 nearest vector data with query vector
SELECT
    uv.id,
    cosine_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

╭────────────────────────────────────╮
│        id       │ similarity_score │
│ Nullable(Int32) │      Float32     │
├─────────────────┼──────────────────┤
│               10.021507084 │
│               30.05743867 │
│               70.07149047 │
│           836070.10362792 │
│          6312040.11076915 │
│          6778350.11215311 │
│          2467110.11223245 │
│          6777940.11382812 │
│          4805930.11518437 │
│          7256380.116248846 │
╰────────────────────────────────────╯
10 rows read in 0.284 sec. Processed 925.96 thousand rows, 0 B (3.26 million rows/s, 0 B/s) (without cache)
10 rows read in 0.020 sec. Processed 925.96 thousand rows, 0 B (46.3 million rows/s, 0 B/s) (with cache)

-- explain display the vector pruning
explain SELECT
    uv.id,
    l1_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

-[ EXPLAIN ]-----------------------------------
RowFetch
├── output columns: [uv._vector_score (#2), uv._row_id (#3), uv.id (#0)]
├── columns to fetch: [id]
├── estimated rows: 10.00
└── Limit
    ├── output columns: [uv._vector_score (#2), uv._row_id (#3)]
    ├── limit: 10
    ├── offset: 0
    ├── estimated rows: 10.00
    └── Sort
        ├── output columns: [uv._vector_score (#2), uv._row_id (#3)]
        ├── sort keys: [_vector_score ASC NULLS LAST]
        ├── estimated rows: 8000000.00
        └── TableScan
            ├── table: default.default.t
            ├── output columns: [_vector_score (#2), _row_id (#3)]
            ├── read rows: 976507
            ├── read size: 0
            ├── partitions total: 34
            ├── partitions scanned: 4
            ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 34 to 34, vector pruning: 34 to 4>]
            ├── push downs: [filters: [], limit: 10]
            └── estimated rows: 8000000.00

-- Drop vector index
DROP VECTOR INDEX idx ON t;

-- Select top 10 nearest vector data with query vector without vector index
SELECT
    uv.id,
    cosine_distance(uv.embedding, [0.0, 16.0, 35.0, 5.0, 32.0, 31.0, 14.0, 10.0, 11.0, 78.0, 55.0, 10.0, 45.0, 83.0, 11.0, 6.0, 14.0, 57.0, 102.0, 75.0, 20.0, 8.0, 3.0, 5.0, 67.0, 17.0, 19.0, 26.0, 5.0, 0.0, 1.0, 22.0, 60.0, 26.0, 7.0, 1.0, 18.0, 22.0, 84.0, 53.0, 85.0, 119.0, 119.0, 4.0, 24.0, 18.0, 7.0, 7.0, 1.0, 81.0, 106.0, 102.0, 72.0, 30.0, 6.0, 0.0, 9.0, 1.0, 9.0, 119.0, 72.0, 1.0, 4.0, 33.0, 119.0, 29.0, 6.0, 1.0, 0.0, 1.0, 14.0, 52.0, 119.0, 30.0, 3.0, 0.0, 0.0, 55.0, 92.0, 111.0, 2.0, 5.0, 4.0, 9.0, 22.0, 89.0, 96.0, 14.0, 1.0, 0.0, 1.0, 82.0, 59.0, 16.0, 20.0, 5.0, 25.0, 14.0, 11.0, 4.0, 0.0, 0.0, 1.0, 26.0, 47.0, 23.0, 4.0, 0.0, 0.0, 4.0, 38.0, 83.0, 30.0, 14.0, 9.0, 4.0, 9.0, 17.0, 23.0, 41.0, 0.0, 0.0, 2.0, 8.0, 19.0, 25.0, 23.0, 1.0]::vector(128)) AS similarity_score
FROM
    t uv
ORDER BY
    similarity_score ASC
limit 10

╭─────────────────────────────────────╮
│        id       │  similarity_score │
│ Nullable(Int32) │      Float32      │
├─────────────────┼───────────────────┤
│               1-0.00000011920929 │
│               30.036221445 │
│               70.047054827 │
│           836070.08021164 │
│          6312040.08787441 │
│          6778350.08972484 │
│          2467110.090816796 │
│          6777940.091252804 │
│          4805930.0922364 │
│           103370.0925557 │
╰─────────────────────────────────────╯
10 rows read in 0.638 sec. Processed 1 million rows, 492.33 MiB (1.64 million rows/s, 809.76 MiB/s) (without cache)
10 rows read in 0.580 sec. Processed 1 million rows, 492.33 MiB (1.64 million rows/s, 807.11 MiB/s) (with cache)

part of: #17972

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jun 11, 2025
@b41sh b41sh force-pushed the feat-vector-hnsw branch 2 times, most recently from 01352f1 to ec5e851 Compare June 19, 2025 05:36
@b41sh b41sh force-pushed the feat-vector-hnsw branch from ec5e851 to 0b36252 Compare July 10, 2025 03:56
@b41sh b41sh marked this pull request as ready for review July 10, 2025 06:28
@b41sh b41sh requested review from sundy-li and BohuTANG July 10, 2025 06:28
@BohuTANG BohuTANG added the ci-cloud Build docker image for cloud test label Jul 10, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18134-8bdc45c-1752137107

note: this image tag is only available for internal use.

@b41sh b41sh force-pushed the feat-vector-hnsw branch from 20e788e to 0dc2c04 Compare July 10, 2025 11:25
@BohuTANG BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 10, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18134-5437daf-1752156992

note: this image tag is only available for internal use.

@b41sh b41sh force-pushed the feat-vector-hnsw branch from 5b2cc3f to 322727c Compare July 13, 2025 06:17
@BohuTANG BohuTANG added ci-cloud Build docker image for cloud test and removed ci-cloud Build docker image for cloud test labels Jul 13, 2025
Copy link
Contributor

Docker Image for PR

  • tag: pr-18134-b4885b7-1752392120

note: this image tag is only available for internal use.

@BohuTANG
Copy link
Member

From my test, almost works 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants